\(
\newcommand{\water}{{\rm H_{2}O}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\Q}{\mathbb{Q}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\d}{\mathop{}\!\mathrm{d}}
\newcommand{\grad}{\nabla}
\newcommand{\T}{^\text{T}}
\newcommand{\mathbbone}{\unicode{x1D7D9}}
\renewcommand{\:}{\enspace}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\Tr}{Tr}
\newcommand{\norm}[1]{\lVert #1\rVert}
\newcommand{\KL}[2]{ \text{KL}\left(\left.\rule{0pt}{10pt} #1 \; \right\| \; #2 \right) }
\newcommand{\slashfrac}[2]{\left.#1\middle/#2\right.}
\)
Often, we encounter objective functions are expectations (for intance, in VAEs):
\[
\mathcal{L}(\theta, \phi) = \E_{q_\phi(z)}\big[ \, f_\theta(z) \, \big]
\]
In order to minimize the objective function, we have to take gradients of this expectation.
-
It is easy to optimize the function parameters \(\; \theta \;\), since the expectation operation does not depend on them, so we can bring the gradient inside the expectation:
\[
\nabla_\theta \, \mathcal{L}(\theta, \phi) \; = \; \nabla_\theta \, \E_{q_\phi(z)}\big[ \, f_\theta(z) \, \big] \; = \; \E_{q_\phi(z)}\big[ \, \nabla_\theta \, f_\theta(z) \, \big]
\]
-
However, taking the gradient with respect to the distributional parameters \(\; \phi \;\) is more difficult, because the expectation depends on them, so we cannot bring the gradient inside the expectation:
\[
\nabla_\phi \, \mathcal{L}(\theta, \phi) \; = \; \nabla_\phi \, \E_{q_\phi(z)}\big[ \, f_\theta(z) \, \big]
\]
The strategy to compute the gradient wrt \(\; \phi \;\) is to convert this ugly expression into an expectation (so that we can estimate it by taking several samples and averaging, i.e. with a Monte Carlo approximation). Two possible ways to do this are:
1 Kingma and Welling 2014. Auto-Encoding Variational Bayes.